from IPython.display import HTML
HTML('''
<script
src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js ">
</script>
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
} else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit"
value="Click here to toggle on/off the raw code."></form>
''')
Abstract
The research investigates the impact of the Oscars on the Wikipedia pageviews of related parties, including actors, production staff, and the films themselves. The study employs a methodology involving data preprocessing, exploration, and analysis using Spark to handle large datasets efficiently. Initial preprocessing converted Wikipedia pageview data into a more manageable parquet format. The analysis revealed a significant increase in pageviews for actors and films around the Oscars event, confirming the initial hypothesis of the Oscars' impact. However, the production staff did not experience the same level of recognition. The study highlights disparities in public attention and suggests areas for further research, such as integrating clickstream data to understand the sources of pageviews and employing cluster computing for more efficient data processing. This work underscores the value of distributed computing in handling extensive datasets and provides a foundation for future studies on the visibility and recognition of various contributors in the film industry.
Problem Statement
The Academy Awards, a prestigious celebration of cinematic excellence, is often perceived as a catalyst for heightened recognition and fame for the films, actors, and directors involved. However, the true extent of its impact remains a subject of inquiry. To delve into this question, we examine the effects of the 95th Academy Awards using Wikipedia page views as a proxy for public interest and attention.
By analyzing the changes in Wikipedia page views before and after the awards ceremony, we can gauge the magnitude of the "Oscar bump" and determine which categories and nominees experienced the most significant surges in public engagement. This analysis will provide valuable insights into the role of the Academy Awards in shaping public perception and generating interest in the film industry.
Motivation
The Academy Awards hold a prominent position in popular culture, shaping public discourse and influencing trends within the entertainment industry. A nomination or win can transform an actor, director, studio, and/or film from relative obscurity into a household name.
An Academy win often coincides with a rise to stardom, increased marketing traction, and heightened cultural influence. Understanding these concepts is crucial for both industry insiders and fans alike. However, there is currently no comprehensive method to understand the trajectory of an artist or film post-Academy Awards win.
In today’s digital era, these behaviors are often reflected online. Avid fans or curious spectators frequently search for details about winning films. Most, if not all, land on one well-known page – Wikipedia. The website holds a wealth of information on a wide range of topics, including the intricate histories and achievements of films and their creators.
Analyzing Wikipedia page views could provide valuable insights. As a significant source of big data, Wikipedia offers detailed statistics on page visits that reflect public interest and engagement. By examining patterns in page views before and after the Academy Awards, researchers can identify how winning impacts public awareness and interest in a film or artist.
Democratizing this information would allow future analysts, critics, and fans to have a clearer understanding of how the Academy Awards influence the entertainment industry. Making data readily accessible can foster a more inclusive and well-informed public discourse around the significance of these awards. This democratization also underscores how the Academy Awards is not just about Hollywood’s glamour but also about recognizing and celebrating the hard work, creativity, and talent of artists.
Scope and Limitations
The study's analysis will be focused on The 95th Academy Awards (Oscars 2023) winners hence, the pageview data that will be used will be from 2022 to 2023. This span is considered the relevant for the given event since most of the films included were released in 2022 and the awards concluded in the 1st quarter of 2023.
Pageview data will be analyzed for a period before, during, and after the announcement of nominations and winners to identify trends and patterns.
The study will be limited based on the following:
- Due to the differences in titles for the available languages, only the English Wikipedia will be considered in the Analysis.
- Titles of Oscars 2023-specific pages will be manually retrieved from the Oscars 2023 Wikipedia Page. As page titles change over time, the filtered pages will be limited based on the date of retrieval.
- The study may be limited based on the successfully read files from the Jojie Public Dataset.
- The study will be limited only to Oscars 2023. Findings and insights generated from this study may not be applicable directly for the other years.
- Other external factors such as concurrent awards, news coverage, and marketing campaigns will not be considered in this study but may have an effect on the result.
Data Source
The pageview complete dumps can be found on Wikimedia’s dumps page maintained by its analytics team. It’s a comprehensive timeseries of pageview data on a per-article basis of Wikimedia projects such as the English Wikipedia, Wikibooks, and many others. Dumps from December 2007 up to the present are formatted similarly and compressed into a bzip (.bz2) file.
Each dump file contains multiple lines of string which contains the dataset’s different features:
- Wiki code: The code identifying the specific wiki project.
- Article title: The name of the article viewed.
- Page id: A unique identifier for each article.
- Mode: This denotes the platform through which the page was viewed, such as desktop or mobile.
- Daily total: The total number of views the article received in one day.
- Hourly counts: The number of views per hour within a day.
Furthermore, hourly counts is formatted into a string of letter-number pairs which can be deciphered as follows:
| Letter | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Equivalent Hour | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
while the number following each letter corresponds to the pageviews for that hour.
For this project, pageviews dumps from 2022 to 2023 will be used considering the timeline of The 95th Academy Awards. The total size of the files to be loaded is shown below.
!du -sch /mnt/data/public/wikipedia/pageviews/pageview_complete/202[23]
368G /mnt/data/public/wikipedia/pageviews/pageview_complete/2022 311G /mnt/data/public/wikipedia/pageviews/pageview_complete/2023 678G total
Methodology
Figure 1 shows the steps done in this study:
Figure 1. Methodology Overview
Table 1 describes each step shown in the figure above.
Table 1. Methodology Steps and Description
| 1 | Initial Data Preprocessing | Convert Wikipedia pageview data to parquet |
| 2 | Data Exploration on the Wikipedia Pageviews Dataset | Explore the Wikipedia Pageviews dataset |
| 3 | Focused Data Preprocessing | Filter and Convert Oscars-specific pageview data to parquet |
| 4 | Data Analysis | Analyze the Pageviews data for Oscars-specific pages |
Initial Data Preprocessing
Steps done within this section were executed using the notebook Process Wikipedia Pageviews.ipynb. This notebook should be run from top to bottom before proceeding.
Wikipedia Pageview data are stored in text files compressed in bz2 format. When stored in this format, Spark will access each row of data in row format.
For easier usage, each of the pageview dump files were loaded in a Spark Session and were converted to parquet. To avoid additional overhead, only the concatenation of filename was done so that the date of each row can be accessed later. Parquet files are then stored in their specific folder.
import os
os.environ['PYARROW_IGNORE_TIMEZONE'] = '1'
import warnings
warnings.filterwarnings('ignore')
import pyspark.pandas as ps
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from project_functions import *
spark = (SparkSession
.builder
.master('local[*]')
.config('spark.sql.execution.arrow.pyspark.enabled', 'true')
.getOrCreate())
spark.sparkContext.setLogLevel('OFF')
sdf1 = spark.read.parquet("pageviews_parquet/*")
sdf2 = spark.read.parquet('pageviews_parquet_repeat/*')
sdf = sdf1.union(sdf2)
wiki = (sdf
.select(
F.col('_c0').alias('project'),
F.col('_c1').alias('title'),
F.col('_c2').alias('pageid'),
F.col('_c3').alias('mode'),
F.col('_c4').cast('int').alias('daily_count'),
F.col('_c5').alias('hourly_counts'),
F.to_date(F.regexp_substr(F.col('filename'),
F.lit(r"\b\d{8}\b")
), 'yyyyMMdd').alias('date')
))
Once the parquet files have been written, it is then loaded into a Spark Session. A sample of the loaded parquet files for Wikipedia Pageviews from years 2022 to 2023 is shown below in Table 2.
Table 2. Sample Wikipedia Pageviews Data
pd.DataFrame(
wiki.limit(5).collect(),
columns=[
"project",
"title",
"pageid",
"mode",
"daily_count",
"hourly_counts",
"date"
]
)
| project | title | pageid | mode | daily_count | hourly_counts | date | |
|---|---|---|---|---|---|---|---|
| 0 | aa.wikibooks | - | null | mobile-web | 1 | N1 | 2023-09-23 |
| 1 | aa.wikibooks | Main_Page | null | desktop | 61 | B3C5D3E6G2H3I3J2K1L5N4P2Q4R1S4T4U3V2W4 | 2023-09-23 |
| 2 | aa.wikibooks | Special:Log/Alkab | null | desktop | 1 | F1 | 2023-09-23 |
| 3 | aa.wikibooks | Template:Wikitext_talk_page_converted_to_Flow | null | desktop | 1 | A1 | 2023-09-23 |
| 4 | aa.wikibooks | User:CptViraj | null | desktop | 1 | A1 | 2023-09-23 |
The data types for each column of the Wikipedia Pageviews Dataset is shown in Table 3.
Table 3. Data Types of the Wikipedia Pageviews Dataset
pd.DataFrame(wiki.dtypes,
columns=['Column', 'Data Type'])
| Column | Data Type | |
|---|---|---|
| 0 | project | string |
| 1 | title | string |
| 2 | pageid | string |
| 3 | mode | string |
| 4 | daily_count | int |
| 5 | hourly_counts | string |
| 6 | date | date |
An examination of the total number of rows in the entire dataset prior to filtering.
print("The total number of rows in the dataset is ", wiki.count(), ".")
The total number of rows in the dataset is 27226845264 .
Examining the total daily view count across the entire dataset.
print("The total views in the 2022 to 2023 Wikipedia dataset is:")
wiki.agg(F.sum('daily_count')).collect()
The total views in the 2022 to 2023 Wikipedia dataset is:
[Row(sum(daily_count)=328436025442)]
Focused Data Preprocessing
Steps done within this section was executed using the notebook Process Oscars Parquet.ipynb. This notebook should be run from top to bottom before proceeding.
Based on the data exploration done, we then focus our study to Oscars 2023-specific pages for analysis. To do so, we limit the dataset to the English wikipedia by filtering the project column to 'en.wikipedia' for easier analysis. We also filter the articles to be analyzed by retrieving the page title of the winning entities of The 95th Academy Awards as shown in their wikipedia article.
The filtered wikipedia articles were then written again to parquet as a checkpoint for easier access. A sample of the filtered Oscars dataset can be found below in Table 4.
Table 4. Sample Oscars Pageview Data
oscars_daily = spark.read.parquet('oscars').toPandas()
oscars_daily
| title | pageid | mode | daily_count | hourly_counts | date | |
|---|---|---|---|---|---|---|
| 0 | Daniel_Barrett_(visual_effects_supervisor) | 34496961 | desktop | 1.0 | I1 | 2023-09-23 |
| 1 | A24 | 38837739 | mobile-web | 2406.0 | A97B105C121D120E86F84G74H67I64J60K54L75M73N83O... | 2023-06-27 |
| 2 | Black_Panther:_Wakanda_Forever | null | mobile-app | 418.0 | A19B26C31D20E16F16G20H15I26J16K11L16M16N13O15P... | 2023-06-13 |
| 3 | Avatar:_The_Way_of_Water | 25813358 | desktop | 4160.0 | A157B136C155D164E149F146G153H156I123J137K141L1... | 2023-05-21 |
| 4 | A24 | 38837739 | desktop | 1900.0 | A64B66C62D77E65F85G76H68I53J72K52L42M58N81O130... | 2023-08-22 |
| ... | ... | ... | ... | ... | ... | ... |
| 67650 | A24 | 38837739 | mobile-web | 3199.0 | A130B157C164D162E124F142G85H104I100J67K80L93M1... | 2022-06-04 |
| 67651 | Brendan_Fraser | 386491 | desktop | 1426.0 | A57B63C38D49E46F37G48H49I52J68K72L77M61N58O67P... | 2022-06-04 |
| 67652 | Chandrabose_(lyricist) | 8390040 | mobile-web | 60.0 | C3D2E4F4G4H3I2J1L3M3O1P4Q9R4S7T3U1V2 | 2022-07-09 |
| 67653 | A24 | 472347 | mobile-web | NaN | None | 2022-02-02 |
| 67654 | Guillermo_del_Toro's_Pinocchio | 62106165 | desktop | NaN | None | 2022-02-02 |
67655 rows × 6 columns
As a sanity check, duplicates were inspected and none were to be found. This means that each row is distinct from the others in the dataset. Table 5. shows the remaining rows after using drop duplicates on the dataset which is equal to the original number of rows.
Table 5. Length of dataset after dropping duplicates
pd.DataFrame([['oscars_daily.drop_duplicates()',
len(oscars_daily.drop_duplicates())]],
columns=['Action', 'Result'])
| Action | Result | |
|---|---|---|
| 0 | oscars_daily.drop_duplicates() | 67655 |
Table 6 shows that null data is present in the daily_count and hourly_counts columns. This may be due to errors in the reading process.
Table 6. Oscars Data Information
df_info_to_dataframe(oscars_daily)
| Column | Non-Null Count | Dtype | |
|---|---|---|---|
| 0 | title | 67655 non-null | object |
| 1 | pageid | 67655 non-null | object |
| 2 | mode | 67655 non-null | object |
| 3 | daily_count | 67510 non-null | float64 |
| 4 | hourly_counts | 67434 non-null | object |
| 5 | date | 67655 non-null | object |
To understand the severity of the null data points, we check which dates are affected.
oscars_daily.iloc[
np.where(np.isnan(oscars_daily['daily_count']))[0]
].date.unique()
array([datetime.date(2022, 2, 2), datetime.date(2022, 1, 26)],
dtype=object)
oscars_daily[oscars_daily.hourly_counts.isna()].date.unique()
array([datetime.date(2022, 1, 30), datetime.date(2022, 2, 2),
datetime.date(2022, 1, 26)], dtype=object)
Three dates were affected namely, Jan 26 and 30, 2022, and Feb 2, 2022. All of these dates are at the beginning of the dataset which are fairly remote from the dates significant to the Oscars 2023 event.
To fix this, null data for the daily_count columns will be filled with zeroes as the dates do not affect the analysis that will be done. A separate DataFrame for daily data will be made that excludes the hourly_counts column. For the hourly_counts, another DataFrame will be made which focuses on this column. However, null values will be dropped and will not be derived from the daily_count due to the lack of hourly distribution. In doing so, it will not affect the distribution of the counts per hour when analyzed.
oscars_daily_clean = (oscars_daily
.drop(columns='hourly_counts')
.fillna(0))
Additional columns will also be added to the datasets. The date column will be converted to datetime, from which will derive the month, year, and day_of_week columns.
Entities denoted by each page title will also be categorized into Movies, Actors, Production, and Others, which will be included in the analysis. Each of these columns have a boolean data type and a True value will entail that the entities belongs to that category.
Sample of the dataset that will be used for analysis can be found below in Table X.
Table X. Sample of the Dataset used for Analysis
oscars_daily_clean['date'] = pd.to_datetime(oscars_daily_clean.date)
oscars_daily_clean['year'] = oscars_daily_clean.date.dt.year
oscars_daily_clean['month'] = oscars_daily_clean.date.dt.month
oscars_daily_clean['day_of_week'] = oscars_daily_clean.date.dt.dayofweek
category_mapping = {
"95th_Academy_Awards": "Others",
"Everything_Everywhere_All_at_Once": "Movies",
"Michelle_Yeoh": "Actors",
"Ke_Huy_Quan": "Actors",
"Jamie_Lee_Curtis": "Actors",
"A24": "Production",
"All_Quiet_on_the_Western_Front_(2022_film)": "Movies",
"The_Whale_(2022_film)": "Movies",
"Avatar:_The_Way_of_Water": "Movies",
"Black_Panther:_Wakanda_Forever": "Movies",
"The_Boy,_the_Mole,_the_Fox_and_the_Horse_(film)": "Movies",
"The_Elephant_Whisperers": "Movies",
"Guillermo_del_Toro's_Pinocchio": "Movies",
"An_Irish_Goodbye": "Movies",
"Navalny_(film)": "Movies",
"RRR": "Movies",
"Top_Gun:_Maverick": "Movies",
"Women_Talking_(film)": "Movies",
"Daniels_(directors)": "Production",
"Brendan_Fraser": "Actors",
"Hauschka": "Production",
"Charlie_Mackesy": "Production",
"M._M._Keeravani": "Production",
"Sarah_Polley": "Production",
"Miriam_Toews": "Production",
"Guillermo_del_Toro": "Production",
"Mark_Gustafson": "Production",
"Alex_Bulkley": "Production",
"Edward_Berger": "Production",
"Daniel_Roher": "Production",
"Odessa_Rae": "Production",
"Shane_Boris": "Production",
"Kartiki_Gonsalves": "Production",
"Guneet_Monga": "Production",
"Chandrabose_(lyricist)": "Production",
"James_Mather_(sound_editor)": "Production",
"Al_Nelson_(sound_engineer)": "Production",
"Chris_Burdon": "Production",
"Christian_M._Goldbeck": "Production",
"Ernestine_Hipper": "Production",
"James_Friend": "Production",
"Adrien_Morot": "Production",
"Judy_Chin": "Production",
"Annemarie_Bradley": "Production",
"Ruth_E._Carter": "Production",
"Paul_Rogers_(film_editor)": "Production",
"Joe_Letteri": "Production",
"Richard_Baneham": "Production",
"Eric_Saindon": "Production",
"Daniel_Barrett_(visual_effects_supervisor)": "Production"
}
# Add a new "category" column to the DataFrame based on the mapping
# Make sure the item column in df_eda matches the keys in category_mapping
oscars_daily_clean["category"] = (oscars_daily_clean["title"]
.map(category_mapping))
oscars_daily_clean = pd.merge(oscars_daily_clean,
pd.get_dummies(oscars_daily_clean.category),
left_index=True, right_index=True)
oscars_daily_clean
| title | pageid | mode | daily_count | date | year | month | day_of_week | category | Actors | Movies | Others | Production | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Daniel_Barrett_(visual_effects_supervisor) | 34496961 | desktop | 1.0 | 2023-09-23 | 2023 | 9 | 5 | Production | False | False | False | True |
| 1 | A24 | 38837739 | mobile-web | 2406.0 | 2023-06-27 | 2023 | 6 | 1 | Production | False | False | False | True |
| 2 | Black_Panther:_Wakanda_Forever | null | mobile-app | 418.0 | 2023-06-13 | 2023 | 6 | 1 | Movies | False | True | False | False |
| 3 | Avatar:_The_Way_of_Water | 25813358 | desktop | 4160.0 | 2023-05-21 | 2023 | 5 | 6 | Movies | False | True | False | False |
| 4 | A24 | 38837739 | desktop | 1900.0 | 2023-08-22 | 2023 | 8 | 1 | Production | False | False | False | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 67650 | A24 | 38837739 | mobile-web | 3199.0 | 2022-06-04 | 2022 | 6 | 5 | Production | False | False | False | True |
| 67651 | Brendan_Fraser | 386491 | desktop | 1426.0 | 2022-06-04 | 2022 | 6 | 5 | Actors | True | False | False | False |
| 67652 | Chandrabose_(lyricist) | 8390040 | mobile-web | 60.0 | 2022-07-09 | 2022 | 7 | 5 | Production | False | False | False | True |
| 67653 | A24 | 472347 | mobile-web | 0.0 | 2022-02-02 | 2022 | 2 | 2 | Production | False | False | False | True |
| 67654 | Guillermo_del_Toro's_Pinocchio | 62106165 | desktop | 0.0 | 2022-02-02 | 2022 | 2 | 2 | Movies | False | True | False | False |
67655 rows × 13 columns
With the preprocessing done, we get the total daily count and count of unique titles in the dataset per year-month to have a view on the the months that lack certain data (see Table 7).
Table 7. Total Daily Count and Count of Unique Titles per Year-Month
oscars_daily_clean.groupby(['year', 'month']).agg({'title':'nunique',
'daily_count': 'sum'})
| title | daily_count | ||
|---|---|---|---|
| year | month | ||
| 2022 | 1 | 33 | 1458271.0 |
| 2 | 33 | 876168.0 | |
| 3 | 34 | 1996419.0 | |
| 4 | 33 | 2087088.0 | |
| 5 | 35 | 4129596.0 | |
| 6 | 34 | 3621363.0 | |
| 7 | 36 | 3131939.0 | |
| 8 | 36 | 3505628.0 | |
| 9 | 37 | 7708977.0 | |
| 10 | 36 | 4814380.0 | |
| 11 | 40 | 11675821.0 | |
| 12 | 39 | 17995389.0 | |
| 2023 | 1 | 40 | 19711419.0 |
| 2 | 40 | 11565567.0 | |
| 3 | 50 | 31794293.0 | |
| 4 | 50 | 5805783.0 | |
| 5 | 49 | 4024597.0 | |
| 6 | 49 | 4554705.0 | |
| 7 | 49 | 4157075.0 | |
| 8 | 48 | 3444620.0 | |
| 9 | 49 | 2905820.0 | |
| 10 | 49 | 3001665.0 | |
| 11 | 49 | 2861168.0 | |
| 12 | 49 | 3475458.0 |
Aside from reading errors of the data, some pages may have been created within the span of dates selected or may have changes in the title which we used to filter the dataset. These are some reasons why there are less unique page titles in 2022 vs in 2023. However, for the awarding of The 95th Academy Awards which was held on March 2023, there more data points that were considered so the analysis will still proceed.
To ensure that no changes would affect the original dataframe a copy of oscars_daily_clean will be stored in df_daily_eda. This served as the dataframe to be used in the analysis section.
df_daily_eda = oscars_daily_clean.copy()
The new dataframe is then adjusted to be grouped by its titles and aggregated across the view counts. The purpose of this was to remove the repetition of titles across view modes, and pageid, and instead summed up the total view count for the title.
# List of columns to exclude
exclude_columns = ['pageid', 'mode', 'daily_count']
# Create a list of column titles, excluding the specified columns
titles = [x for x in df_daily_eda.columns if x not in exclude_columns]
df_daily_eda = (
df_daily_eda
.groupby(titles, as_index=False)
.agg({"daily_count": "sum"})
)
With the data more focused, the data was examined through plots. Plotly was the primary method used given its interactive nature. To start the data distribution was examined via a bar plot as seen below in Fig. 1.
Fig. 1. Total Dates Covered per Title
#Aggregating data for the "title" column
title_counts = df_daily_eda['title'].value_counts().reset_index()
title_counts.columns = ['title', 'count']
# Aggregating data for the "category" column
category_counts = df_daily_eda['category'].value_counts().reset_index()
category_counts.columns = ['category', 'count']
# Plot the Title distribution
title_fig = px.bar(title_counts,
x='title',
y='count',
title='Title Distribution',
labels={'count': 'Frequency'})
title_fig.update_layout(xaxis={'categoryorder': 'total descending'},
yaxis={'automargin': True,
'range': [0, title_counts['count'].max() + 200]})
# Display plots
title_fig.show()
The above plot, shows the distribution of the the titles. From here it was seen that ideally there should be around 638 instances of a title, however th shape and actual counts shows that, that is not the case for all. The floor is Annemarie_Bradley at around 8 instances only.
Next, a box plot was made to check the distribution of views across the titles as seen in Fig. 2 below.
Fig. 2. Distribution of Pageviews per Title
import pandas as pd
import plotly.express as px
# Create the violin plot
fig_box = px.box(df_daily_eda, x="title", y="daily_count", notched=False)
fig_box.update_layout(
title='Distribution of Views per Title',
xaxis_title='Title',
yaxis_title='Views',
xaxis={
'categoryorder': 'total descending',
'tickangle': 45, # Rotate labels by -45 degrees
'tickfont': {'size': 10} # Set font size to 10
},
yaxis_type="log",
showlegend=False,
height=1200,
width=800,
margin=dict(b=200),
)
# Display the updated figure
fig_box.show()
Due to the volume of the actounts counts of views, the scale plotted on a logarithmic scale so that it becomes more readable. The functionality of plotly allowed an examination of the percentiles and mean median mode of each title with respect to their views.
For example, Avatar_The_Way_of_water which has overall the highest number of views daily has a max number of daily views of around 792.89k and a median of 16.502k. This gap suggests that there were periods in time where the views had an exponential spike in count. This was also suggested by the other plots especially those on the leftside, which tended be those of the Actors or Movies categories.
Next we check for any correlations between categories, as seen in Fig. 3 below.
Fig. 3. Correlation Matrix among Features
# Assuming df_hourly_eda2 is your DataFrame
# Select only numeric columns for correlation calculation
numeric_df = df_daily_eda[["Actors", "Movies", "Production", "Others",
"month", "day_of_week", "daily_count", "year"]]
# Now calculate correlation on numeric DataFrame
correlation_matrix = numeric_df.corr()
plt.figure(figsize=(12, 8))
# Use seaborn to create a heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='viridis', fmt=".2f",
linewidths=.5, cbar_kws={"shrink": .8})
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.title('Correlation Matrix')
plt.show()
From the outset it was seen that the category of Production was strongly negatively correlated(-0.75) to the Movies category. This suggested that the pageview data of one tends to exclude the other.
There are some notable relationships however. First is between Movies and Actors, with a weak negative correlation (-.20), suggesting that higher values of one results in lower for the other.
In terms of positive relationships, it appeared the both Movies and Actors had slightly positive relationships with pageviews suggesting that there would be an increase in the value for these categories in overall views or instances.
Given the volume of data and its spatial nature, the next step was to examine the trend of the daily_count or the number of views per page. To this end a timeseries graph (Fig. 4) was created using the line plot in plotly.
Fig. 4. Timeseries Plot for Daily Views per Page
actors = np.ravel(
[x for x in df_daily_eda.groupby('title')['Actors'].unique()]
)
movies = np.ravel(
[x for x in df_daily_eda.groupby('title')['Movies'].unique()]
)
production = np.ravel(
[x for x in df_daily_eda.groupby('title')['Production'].unique()]
)
others = np.ravel(
[x for x in df_daily_eda.groupby('title')['Others'].unique()]
)
df_daily_eda = df_daily_eda.sort_values(by='date')
# Assuming df_daily_eda is your DataFrame
df_daily_eda['date'] = pd.to_datetime(df_daily_eda['date'])
# Group by day
df_daily = df_daily_eda.groupby(
['date', 'title', 'category']
)['daily_count'].sum().reset_index()
# Group by month
df_daily_eda['month'] = df_daily_eda['date'].dt.to_period('M')
df_monthly = df_daily_eda.groupby(
['month', 'title', 'category']
)['daily_count'].sum().reset_index()
df_monthly['month'] = df_monthly['month'].dt.to_timestamp()
# Create a figure
fig_line = go.Figure()
# Get the unique categories
categories = df_daily['category'].unique()
# Sort titles alphabetically
titles = sorted(df_daily['title'].unique())
# Add daily traces for each unique title
for title in titles:
title_data = df_daily[df_daily['title'] == title]
fig_line.add_trace(go.Scatter(
x=title_data['date'],
y=title_data['daily_count'],
mode='lines+markers',
name=title,
visible=True, # Initially, make daily traces visible
# legendgroup='daily',
hovertemplate='Date: %{x}<br>Views: %{y}<br>Title: %{text}',
text=[title] * len(title_data)
))
# Add monthly traces for each unique title (initially hidden)
for title in titles:
title_data = df_monthly[df_monthly['title'] == title]
fig_line.add_trace(go.Scatter(
x=title_data['month'],
y=title_data['daily_count'],
mode='lines+markers',
name=title,
visible=False, # Initially, make monthly traces hidden
# legendgroup='monthly',
hovertemplate='Month: %{x}<br>Views: %{y}<br>Title: %{text}',
text=[title] * len(title_data)
))
# Create buttons for dropdown menu to filter by category
buttons = []
for category in categories:
category_titles = sorted(
df_daily[df_daily['category'] == category]['title'].unique()
)
visible_daily = ([title in category_titles for title in titles] +
[False] * len(titles))
visible_monthly = ([False] * len(titles) +
[title in category_titles for title in titles])
buttons.append(dict(
label=f'{category} (Daily)',
method='update',
args=[{'visible': visible_daily},
{'title': (f'Views of Titles Over Time - Category: {category}'
' (Daily)')}]
))
buttons.append(dict(
label=f'{category} (Monthly)',
method='update',
args=[{'visible': visible_monthly},
{'title': (f'Views of Titles Over Time - Category: {category}'
' (Monthly)')}]
))
# Add a button to show all titles (daily)
buttons.append(dict(
label='All (Daily)',
method='update',
args=[{'visible': [True] * len(titles) + [False] * len(titles)},
{'title': 'Views of Titles Over Time - All Categories (Daily)'}]
))
# Add a button to show all titles (monthly)
buttons.append(dict(
label='All (Monthly)',
method='update',
args=[{'visible': [False] * len(titles) + [True] * len(titles)},
{'title': 'Views of Titles Over Time - All Categories (Monthly)'}]
))
# Update layout with dropdown menu
fig_line.update_layout(
title='Views of Titles Over Time - All Categories (Daily)',
xaxis_title='Date',
yaxis_title='Views',
# yaxis=dict(type='log'), #comment out for viewing in normal scale
updatemenus=[dict(
active=0,
buttons=buttons,
x=1.15,
y=1.15
)],
template='plotly_white',
legend=dict(traceorder='normal') # Ensure the legend is sorted
)
# Show the figure
fig_line.show()
Exploration of the above graph yielded several observations.
- Movies and Actors were the most significantly affected by the spike of pageviews on the day of the Oscars in March but there was a spike across all titles
- Some titles in production only made their appearance on March of 2023, during the Oscars period.
- Avatar: The Way of Water had the highest views in a single instance on Dec 1, 2022
- Everything Everywhere All at Once, had the highest searches during the Oscars month of march 2023.
- The Whale had slightly lower but similar levels of engagement during the Oscars, and its premiere the previous year on the fifth of September 2022
A log-scaled version of Fig. 4 is shown below (Fig. 5).
Fig. 5. Timeseries Plot of Daily Views per Page (Log-scale)
# Update layout with dropdown menu
fig_line.update_layout(
title='Views of Titles Over Time - All Categories (Daily)',
xaxis_title='Date',
yaxis_title='Views',
yaxis=dict(type='log'), #comment out for viewing in normal scale
updatemenus=[dict(
active=0,
buttons=buttons,
x=1.15,
y=1.15
)],
template='plotly_white',
legend=dict(traceorder='normal') # Ensure the legend is sorted
)
# Show the figure
fig_line.show()
The modular graph above allowed a thorough examination and comparison of the trends of pageviews between and within categories.
Fig. 6. Top Titles for 2022 and 2023
# Find the top titles by daily_count for each year
top_titles_hist = df_daily_eda.sort_values(
by="daily_count", ascending=False
).groupby("year").apply(lambda x: x.nlargest(10, 'daily_count'))
# Create a Plotly histogram
fig_hist = px.histogram(
top_titles_hist, x="year", y="daily_count", color="title",
title="Top Titles by Daily Count Each Year",
labels={"daily_count": "Daily Count", "title": "Title"}, barmode="group")
# Show the plot
fig_hist.show()
The bar graphs in Fig. 6 support the observation on spikes concerning the movies and actors much more than the production staff.
Fig. 7. Timeseries Graph of The Whale-related pages
# Assuming df_daily_eda is your DataFrame
df_daily_eda['date'] = pd.to_datetime(df_daily_eda['date'])
# Filter the data for "The Whale" group
whale_titles = ['Adrien_Morot', 'Judy_Chin', 'Annemarie_Bradley',
'Brendan_Fraser', 'The_Whale_(2022_film)']
df_whale = df_daily_eda[df_daily_eda['title'].isin(whale_titles)]
# Group by day
df_daily_whale = df_whale.groupby(['date', 'title', 'category']
)['daily_count'].sum().reset_index()
# Group by month
df_whale['month'] = df_whale['date'].dt.to_period('M')
df_monthly_whale = df_whale.groupby(['month', 'title', 'category']
)['daily_count'].sum().reset_index()
df_monthly_whale['month'] = df_monthly_whale['month'].dt.to_timestamp()
# Create a figure
fig_whale = go.Figure()
# Sort titles alphabetically
titles_whale = sorted(df_daily_whale['title'].unique())
# Add daily traces for each unique title in "The Whale"
for title in titles_whale:
title_data = df_daily_whale[df_daily_whale['title'] == title]
fig_whale.add_trace(go.Scatter(
x=title_data['date'],
y=title_data['daily_count'],
mode='lines+markers',
name=title,
visible=True, # Initially, make daily traces visible
hovertemplate='Date: %{x}<br>Views: %{y}<br>Title: %{text}',
text=[title] * len(title_data)
))
# Add monthly traces for each unique title in "The Whale" (initially hidden)
for title in titles_whale:
title_data = df_monthly_whale[df_monthly_whale['title'] == title]
fig_whale.add_trace(go.Scatter(
x=title_data['month'],
y=title_data['daily_count'],
mode='lines+markers',
name=title,
visible=False, # Initially, make monthly traces hidden
hovertemplate='Month: %{x}<br>Views: %{y}<br>Title: %{text}',
text=[title] * len(title_data)
))
# Create buttons for dropdown menu to filter by daily and monthly data
buttons_whale = [
dict(
label='Daily',
method='update',
args=[{'visible': ([True] * len(titles_whale) + [False] *
len(titles_whale))},
{'title': 'Views of Titles Over Time - The Whale (Daily)',
'yaxis': {'type': 'log'}}]
),
dict(
label='Monthly',
method='update',
args=[{'visible': ([False] * len(titles_whale) + [True] *
len(titles_whale))},
{'title': 'Views of Titles Over Time - The Whale (Monthly)',
'yaxis': {'type': 'linear'}}]
)
]
# Update layout with dropdown menu
fig_whale.update_layout(
title='Views of Titles Over Time - The Whale (Daily)',
xaxis_title='Date',
yaxis_title='Views',
yaxis=dict(type='log'), # Set initial y-axis type to log for daily data
updatemenus=[dict(
active=0,
buttons=buttons_whale,
x=1.15,
y=1.15
)],
template='plotly_white',
legend=dict(traceorder='normal') # Ensure the legend is sorted
)
# Show the figure
fig_whale.show()
Fig. 8. Timeseries Graph of Everything, Everywhere, All at Once-related pages
import pandas as pd
import plotly.graph_objects as go
# Assuming df_daily_eda is your DataFrame
df_daily_eda['date'] = pd.to_datetime(df_daily_eda['date'])
# Filter the data for "Everything Everywhere All at Once" group
eeaao_titles = ["Ke_Huy_Quan", "Michelle_Yeoh", "Jamie_Lee_Curtis",
"Everything_Everywhere_All_at_Once", "Paul_Rogers"]
df_eeaao = df_daily_eda[df_daily_eda['title'].isin(eeaao_titles)]
# Group by day
df_daily_eeaao = df_eeaao.groupby(['date', 'title', 'category']
)['daily_count'].sum().reset_index()
# Group by month
df_eeaao['month'] = df_eeaao['date'].dt.to_period('M')
df_monthly_eeaao = df_eeaao.groupby(['month', 'title', 'category']
)['daily_count'].sum().reset_index()
df_monthly_eeaao['month'] = df_monthly_eeaao['month'].dt.to_timestamp()
# Create a figure
fig_eeaao = go.Figure()
# Sort titles alphabetically
titles_eeaao = sorted(df_daily_eeaao['title'].unique())
# Add daily traces for each unique title in "Everything Everywhere All at Once"
for title in titles_eeaao:
title_data = df_daily_eeaao[df_daily_eeaao['title'] == title]
fig_eeaao.add_trace(go.Scatter(
x=title_data['date'],
y=title_data['daily_count'],
mode='lines+markers',
name=title,
visible=True, # Initially, make daily traces visible
hovertemplate='Date: %{x}<br>Views: %{y}<br>Title: %{text}',
text=[title] * len(title_data)
))
for title in titles_eeaao:
title_data = df_monthly_eeaao[df_monthly_eeaao['title'] == title]
fig_eeaao.add_trace(go.Scatter(
x=title_data['month'],
y=title_data['daily_count'],
mode='lines+markers',
name=title,
visible=False, # Initially, make monthly traces hidden
hovertemplate='Month: %{x}<br>Views: %{y}<br>Title: %{text}',
text=[title] * len(title_data)
))
# Create buttons for dropdown menu to filter by daily and monthly data
buttons_eeaao = [
dict(
label='Daily',
method='update',
args=[{'visible': ([True] * len(titles_eeaao) + [False] *
len(titles_eeaao))},
{'title': ('Views of Titles Over Time - Everything Everywhere '
'All at Once (Daily)'), 'yaxis': {'type': 'log'}}]
),
dict(
label='Monthly',
method='update',
args=[{'visible': ([False] * len(titles_eeaao) + [True] *
len(titles_eeaao))},
{'title': ('Views of Titles Over Time - Everything Everywhere '
'All at Once (Monthly)'), 'yaxis': {'type': 'linear'}}
]
)
]
# Update layout with dropdown menu
fig_eeaao.update_layout(
title=('Views of Titles Over Time - Everything Everywhere All '
'at Once (Daily)'),
xaxis_title='Date',
yaxis_title='Views',
yaxis=dict(type='log'), # Set initial y-axis type to log for daily data
updatemenus=[dict(
active=0,
buttons=buttons_eeaao,
x=1.15,
y=1.15
)],
template='plotly_white',
legend=dict(traceorder='normal') # Ensure the legend is sorted
)
# Show the figure
fig_eeaao.show()
Table 8. Difference between Average Daily Views per page Before and After the Oscars
# Assuming df_daily_eda is your DataFrame
df_daily_eda['date'] = pd.to_datetime(df_daily_eda['date'])
# Define the cutoff date
date_cutoff = pd.Timestamp('2023-03-13')
# Define the date range for 7 days before and 7 days after
date_start_before = date_cutoff - pd.Timedelta(days=7)
date_end_after = date_cutoff + pd.Timedelta(days=7)
# Filter data for 7 days before and 7 days after March 13, 2023
df_before = df_daily_eda[(df_daily_eda['date'] >= date_start_before) &
(df_daily_eda['date'] < date_cutoff)]
df_after = df_daily_eda[(df_daily_eda['date'] >= date_cutoff) &
(df_daily_eda['date'] <= date_end_after)]
# Calculate average daily_count for each title 7 days before March 13, 2023
avg_daily_count_before = df_before.groupby('title'
)['daily_count'].mean().reset_index()
avg_daily_count_before.columns = ['title', 'avg_daily_count_before']
# Calculate average daily_count for each title 7 days after March 13, 2023
avg_daily_count_after = df_after.groupby('title'
)['daily_count'].mean().reset_index()
avg_daily_count_after.columns = ['title', 'avg_daily_count_after']
# Merge the results to have a single DataFrame
avg_daily_count = pd.merge(avg_daily_count_before, avg_daily_count_after,
on='title', how='outer')
# Add period columns
avg_daily_count_before['period'] = '7 Days Before March 13, 2023'
avg_daily_count_after['period'] = '7 Days After March 13, 2023'
avg_daily_count_before.columns = ['title', 'avg_daily_count', 'period']
avg_daily_count_after.columns = ['title', 'avg_daily_count', 'period']
# Combine into a single DataFrame
avg_daily_count_period = pd.concat([avg_daily_count_before,
avg_daily_count_after
],
axis=0).reset_index(drop=True)
# Sort by title for better readability
avg_daily_count_period = avg_daily_count_period.sort_values(
by='title').reset_index(drop=True)
# Display the DataFrame
# Set display option to show all rows
pd.set_option('display.max_rows', None)
avg_diff = avg_daily_count_period.pivot(index="title", columns="period",
values="avg_daily_count"
).sort_index(axis=1,
ascending=False).dropna()
avg_diff['Difference'] = (avg_diff["7 Days After March 13, 2023"] -
avg_diff["7 Days Before March 13, 2023"])
avg_diff
| period | 7 Days Before March 13, 2023 | 7 Days After March 13, 2023 | Difference |
|---|---|---|---|
| title | |||
| 95th_Academy_Awards | 65819.285714 | 267900.375 | 202081.089286 |
| A24 | 5447.857143 | 31686.375 | 26238.517857 |
| Adrien_Morot | 69.000000 | 1201.875 | 1132.875000 |
| Alex_Bulkley | 36.285714 | 212.750 | 176.464286 |
| All_Quiet_on_the_Western_Front_(2022_film) | 29047.000000 | 70116.500 | 41069.500000 |
| An_Irish_Goodbye | 9.142857 | 8799.125 | 8789.982143 |
| Avatar:_The_Way_of_Water | 40677.714286 | 42877.875 | 2200.160714 |
| Black_Panther:_Wakanda_Forever | 19906.142857 | 22201.375 | 2295.232143 |
| Brendan_Fraser | 44603.142857 | 324174.625 | 279571.482143 |
| Chandrabose_(lyricist) | 444.285714 | 9161.375 | 8717.089286 |
| Charlie_Mackesy | 496.142857 | 2746.250 | 2250.107143 |
| Chris_Burdon | 9.000000 | 143.000 | 134.000000 |
| Daniel_Barrett_(visual_effects_supervisor) | 15.428571 | 141.625 | 126.196429 |
| Daniel_Roher | 307.428571 | 3651.750 | 3344.321429 |
| Daniels_(directors) | 9489.571429 | 86606.875 | 77117.303571 |
| Edward_Berger | 1421.000000 | 4570.000 | 3149.000000 |
| Eric_Saindon | 27.714286 | 389.625 | 361.910714 |
| Everything_Everywhere_All_at_Once | 72574.428571 | 459437.750 | 386863.321429 |
| Guillermo_del_Toro | 5158.285714 | 20537.125 | 15378.839286 |
| Guillermo_del_Toro's_Pinocchio | 5644.428571 | 17002.500 | 11358.071429 |
| Guneet_Monga | 287.285714 | 21180.750 | 20893.464286 |
| Hauschka | 513.142857 | 3936.875 | 3423.732143 |
| James_Mather_(sound_editor) | 10.142857 | 117.500 | 107.357143 |
| Jamie_Lee_Curtis | 32897.000000 | 238968.125 | 206071.125000 |
| Joe_Letteri | 60.142857 | 607.875 | 547.732143 |
| Kartiki_Gonsalves | 156.714286 | 22581.000 | 22424.285714 |
| Ke_Huy_Quan | 22609.857143 | 319866.875 | 297257.017857 |
| M._M._Keeravani | 2714.714286 | 35578.250 | 32863.535714 |
| Mark_Gustafson | 3.500000 | 5.625 | 2.125000 |
| Michelle_Yeoh | 38678.571429 | 314330.875 | 275652.303571 |
| Miriam_Toews | 2691.428571 | 3095.000 | 403.571429 |
| Navalny_(film) | 2146.714286 | 17584.625 | 15437.910714 |
| Odessa_Rae | 77.000000 | 748.125 | 671.125000 |
| RRR | 122.142857 | 339.500 | 217.357143 |
| Richard_Baneham | 87.285714 | 970.125 | 882.839286 |
| Ruth_E._Carter | 383.000000 | 10038.500 | 9655.500000 |
| Sarah_Polley | 9484.714286 | 39948.125 | 30463.410714 |
| Shane_Boris | 64.428571 | 355.125 | 290.696429 |
| The_Boy,_the_Mole,_the_Fox_and_the_Horse_(film) | 2076.428571 | 7927.375 | 5850.946429 |
| The_Elephant_Whisperers | 1385.428571 | 62919.875 | 61534.446429 |
| The_Whale_(2022_film) | 43692.857143 | 200155.125 | 156462.267857 |
| Top_Gun:_Maverick | 20000.714286 | 26370.000 | 6369.285714 |
| Women_Talking_(film) | 34357.000000 | 32671.250 | -1685.750000 |
Table 9. Summary Statistics of the Difference of Average Daily Views per page Before and After the Oscars
avg_diff.describe()
| period | 7 Days Before March 13, 2023 | 7 Days After March 13, 2023 | Difference |
|---|---|---|---|
| count | 43.000000 | 43.000000 | 43.000000 |
| mean | 11993.104651 | 63578.029070 | 51584.924419 |
| std | 19007.590382 | 113297.582174 | 97780.624236 |
| min | 3.500000 | 5.625000 | -1685.750000 |
| 25% | 82.142857 | 1086.000000 | 609.428571 |
| 50% | 1421.000000 | 17002.500000 | 6369.285714 |
| 75% | 19953.428571 | 41413.000000 | 31663.473214 |
| max | 72574.428571 | 459437.750000 | 386863.321429 |
To effectively answer the question, "Does the Oscars have an effect on Wikipedia Pageviews?", we want to find how different the pageviews on the days of the Nominations and Awarding is compared to the rest of their respective months. Calculating the z-score of those days based on the distribution of pageviews of the rest of that month enables us to answer this statistically.
nomination_date = pd.to_datetime('2023-01-24', format='%Y-%m-%d')
awarding_date = pd.to_datetime('2023-03-13', format='%Y-%m-%d')
zscore_dict = {}
for title in category_mapping.keys():
zscore_dict[title] = {}
zscore_dict[title]['Nominations'] = calculate_date_zscore(
df_daily_eda, title, nomination_date)
zscore_dict[title]['Awarding'] = calculate_date_zscore(
df_daily_eda, title, awarding_date)
z_df = pd.DataFrame(zscore_dict).T.dropna().sort_index()
Distribution Histogram
After engaging with data through a thorough exploratory data analysis, several insights and observations were gained. Initially there was a discrepancy found for the distribution of the data as seen in (see Fig. 2). Aside from the discussed limitations, the lack of view histories for the page suggested either a lack of traffic, or a recency in the formation of the page. As of the writing of this paper, one of the titles in the production category, Annemarie Bradley, only begun to garner views on March 29, 2023, thirteen days after the academy awards. This may be the day their wikipedia page was formed given their win in their category, or it was only when they'd gained attention. Apart
Box and Whiskers Plot
Due to the initial findings from the histogram, the next step was to check how the pageviews or daily_countare distributed across the titles. This resulted in the box and whiskers plot (see Fig. 2) Avatar the Way of Water was at the top of the view counts overall, and the rest of the plot showed that across titles the view counts varied, and there were clear spikes and greater outliers in terms of volume for the movies and actors. This heavily suggested that certain time periods and happenings in those periods caused the pageviews of these titles to increase exponentially.
Correlation Matrix
The next instance was to examine the relationships between and among the different features. The titles themselves were not used and instead the relationship of the categories and time was observed(see Fig. 3). As mentioned in the earlier section there was a significantly negative correlation between those in the Movies category and the Production category. This was most likely due to the boolean nature of the those columns, given that there were more instances of datapoints being part of Production given that there were more production based oscars to win, as opposed to movies, it was a given that their negative relationship would be more prominent.
A noteworthy observation from the analysis is the subtle yet significant negative correlation between the total views a production received and the proportion of those views attributed to production staff members. This finding suggests that as a production gains popularity and accumulates more views overall, the share of those views directed towards the individuals involved in the production process – directors, cinematographers, editors, and other behind-the-scenes contributors – tends to decrease.
The phenomenon found in the correlation matrix aligns with the earlier discovery that the majority of views are typically garnered by the movies themselves and the actors featured in them. As audiences engage with a production, their attention naturally gravitates towards the final product and the on-screen talent, potentially overshadowing the crucial contributions of the individuals who bring the creative vision to life.
Timeseries Graphs
One of the primary concerns of the paper was the effect the Academy Awards would have on its winners. A simple timeseries graph plotting the the view count of each title across two years was created as seen above(see Fig. 5), and while it contains a hoard of information, for the purposes of illustration, two films and their respectives actors and production staff, at least those who won in their respective categories will be the focus of the discussion here. First will be Everything Everywhere All at Once, the best film of the 95th Academy Awards. With the exception of Ke Huy Quan, the actors were receivieng monthly views in the hundreds of thousands which was on its own is quite large but small relative to March 13th, 2023(see Fig. 8). This supports the initial assumption of the significant attention the Oscars offers. Ke Huy Quan serves as an excellent example of this effect given that prior to the film he was practically not part of the industry, his co-stars were already well-established figures in Hollywood. He started with views in the single digits in 2022 to hitting the highest number of views among all his co-stars, and in the following days was receiving activity similar to them as well.
A similar trend was found with the Whale with both Brendan Fraser and the film itself following a similar trend in pageviews, with both experiencing a significant spike in views on the day of its release, only beaing beaten by the day of the Academy Awards(see Fig. 7). Both films experienced a slow rise in views following their nomination and peaking during the day of the Oscars, followed by a rapid decline in activity days after. Even when examining the average views of each title the number would increase in the days after the Oscars. A notable exception was the film Women Talking. All names in the titles column showed no negative difference in their average views when comparing their numbers after and before, except for Women Talking. One week prior to the Oscars the film had an average of about thirty-four thousand views and afterwards it received an average of one-thousand six-hundered less views.
It was also in the examination of the The Whale that there was support for some of the observations seen in the correlation matrix with regards to production and view counts. For example, Annemarie Bradley the make-up artist of The Whale, a quick comparison of their views across time showed that inspite of the bump received by the oscar win for them it was significantly smaller compared to the film. Even when comparing The Whale, its production staff members that won, Judy Chin, Annemarie, and Adrien Morot the jump was from a high ten thousand views to The Whale's and Brendan Fraser's one-hundred thousand views, peaking at around three million. This example underscores the trend observed, highlighting the disparity in visibility between on-screen and behind-the-scenes talent.
Z-Score Table
Table 10 below shows the z-score of the pageviews for both nominations and awarding days of each page. Z-scores higher than 3 suggest that the day of the event is an "outlier" or "significantly different" from the other days in the month.
Table 10. Z-Score Table
z_df
| Nominations | Awarding | |
|---|---|---|
| 95th_Academy_Awards | 12.755634 | 30.617676 |
| A24 | 1.718809 | 16.466546 |
| Adrien_Morot | 4.371992 | 31.806244 |
| All_Quiet_on_the_Western_Front_(2022_film) | 5.138306 | 20.339506 |
| An_Irish_Goodbye | 14.171374 | 11.092046 |
| Avatar:_The_Way_of_Water | -0.648066 | 4.052241 |
| Black_Panther:_Wakanda_Forever | 0.348064 | 5.730825 |
| Brendan_Fraser | 1.160336 | 13.663121 |
| Chandrabose_(lyricist) | 0.952919 | 13.609512 |
| Charlie_Mackesy | -0.186678 | 16.400760 |
| Chris_Burdon | 1.134733 | 33.549936 |
| Daniel_Barrett_(visual_effects_supervisor) | 1.083357 | 26.363566 |
| Daniel_Roher | 4.038545 | 28.532929 |
| Daniels_(directors) | 2.784789 | 20.625195 |
| Edward_Berger | 5.997203 | 31.319557 |
| Eric_Saindon | 4.046397 | 4.684864 |
| Everything_Everywhere_All_at_Once | 3.484714 | 16.170656 |
| Guillermo_del_Toro | -0.243263 | 22.288759 |
| Guillermo_del_Toro's_Pinocchio | 0.527514 | 16.944542 |
| Guneet_Monga | 3.841287 | 9.340992 |
| Hauschka | 4.295989 | 38.465112 |
| James_Mather_(sound_editor) | 3.531734 | 23.772698 |
| Jamie_Lee_Curtis | 1.836097 | 18.955645 |
| Joe_Letteri | 4.391260 | 35.033553 |
| Ke_Huy_Quan | 0.159509 | 14.190561 |
| M._M._Keeravani | -0.077411 | 9.824345 |
| Mark_Gustafson | -0.297735 | 4.667252 |
| Michelle_Yeoh | 0.563599 | 14.309144 |
| Miriam_Toews | 3.533487 | 8.961985 |
| Navalny_(film) | 4.863658 | 30.284319 |
| Paul_Rogers_(film_editor) | -1.367073 | 1.716332 |
| RRR | 0.832891 | 14.948348 |
| Richard_Baneham | 12.081255 | 31.630784 |
| Ruth_E._Carter | 3.793557 | 39.153387 |
| Sarah_Polley | 5.697720 | 26.075422 |
| The_Boy,_the_Mole,_the_Fox_and_the_Horse_(film) | 4.366274 | 20.947364 |
| The_Elephant_Whisperers | 3.473158 | 15.055387 |
| The_Whale_(2022_film) | 1.968665 | 10.318191 |
| Top_Gun:_Maverick | 0.667426 | 9.117375 |
| Women_Talking_(film) | 5.701720 | 8.801358 |
Table 11. Z-Score Table Summary Statistics
z_df.describe()
| Nominations | Awarding | |
|---|---|---|
| count | 40.000000 | 40.000000 |
| mean | 3.162344 | 18.745701 |
| std | 3.487761 | 10.268590 |
| min | -1.367073 | 1.716332 |
| 25% | 0.641470 | 10.194730 |
| 50% | 3.128974 | 16.433653 |
| 75% | 4.367704 | 26.905907 |
| max | 14.171374 | 39.153387 |
Based on the summary statistics of the z-score table (Table 11), on average, the Nominations spike was significant, however this is eclipsed when compared to the Awarding spike which was over 18 standard deviations away from the mean of March 2023. We also see that the standard deviation of the z-scores is greater during the awarding which suggests that the different pages did not have the same magnitude in terms of their individual spikes.
The winner for Best Costume Design, Ruth E. Carter for Black Panther: Wakanda Forever, got the highest z-score during the awarding ceremony at around 39 standard deviations away, while the lowest spike for the same day is Paul Rogers, the winner of Best Film Editing for Everything, Everywhere, All at Once. However, we should also take note that the z-score are affected by the mean and standard deviation of the distribution of pageviews for the rest of the month and is not solely on the total number of pageviews for those days.
Conclusions and Insights
In summary, the findings of the research show that the Oscars have a significant effect on the Wikipedia views of the parties concerned, specifically the actors, production staff, and the films themselves. The data shows that the actors, whose average views were already quite high, experienced a substantial increase in views on Oscars day. The respective films they were featured in followed a similar trend in their views as well. These findings supported the initial assumption that the Oscars have a significant impact, with the average views increasing after the nominations compared to the time before. However, another aspect of the findings revealed that not all members of the team are recognized with the same level of activity. While actors and films received considerable attention, other production staff members did not experience the same spike in interest. This disparity highlights the varying levels of public recognition within the industry and suggests that further research could explore ways to bring more balanced attention to all contributors of a film.
Recommendations
The insights generated from this Data Analysis has sufficiently answered the problem stated earlier. However, there some improvements that can be done for future researchers of this matter.
Of paramount importance is to ensure the proper gathering of the dataset. As found earlier, there are multiple dates that were not included in our analysis due to errors in reading the respective pageview dump. Additionally, filtering using page titles did not give a complete result for the set range because of the possible revisions done by Wikipedia contributors. However, there is also a challenge in acquiring and using pageid for filtering since it will not consider users of the mobile app platform (pageids are null for this mode).
The use of Spark made it possible to do distributed computing on the large dataset. However, for this study, only one machine with four logical cores was used to manage the dataset. To speed up the process, cluster computing with multiple machines can be employed which will distribute the partitions to more executors.
Future research can also expound on the study by pairing it with clickstream data to understand the source of pageviews. Pageviews coming from neighbor pages can explain the possible connections between pages and show the behavior of Wikipedia users. Additionally, future studies can challenge the assumptions and findings here by exploring the nominees as well as the winners of the Oscars.
References
- Pageview complete dumps. (n.d.). Wikimedia.Org. Retrieved May 15, 2024, from https://dumps.wikimedia.org/other/pageview_complete/readme.html
- Wikipedia contributors. (2024, May 6). 95th Academy Awards. Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=95th_Academy_Awards&oldid=1222580018
Acknowledgements
In preparing this technical paper, Open AI’s ChatGPT was employed to help rephrase sentences and to improve structure, clarity, and readability of the document. This tool does not serve as a primary source of information but to enhance the presentation of the research conducted.
The team would also like to acknowledge Prof. Christian Alis for his mentorship and patience in making this study possible.